library(nycflights13)
library(tidyverse)
filter()filter(flights, arr_delay >= 120)
filter(flights, dest %in% c("IAH", "HOU"))
airlines
filter(flights, carrier %in% c("UA", "AA", "DL"))
filter(flights, month %in% c(7, 8, 9))
filter(flights, arr_delay > 120, dep_delay <= 0)
filter(flights, arr_delay >= 60, arr_delay - dep_delay < -30)
filter(flights, dep_time <= 600 | dep_time == 2400)
between(). What does it do? Can you use it to simplify the code needed to answer the previous challenges?between() returns items that arehave a variable value between two boundary values (inclusive, ie it tests for >= and <= on the left and right boundaries).
filter(flights, between(month, 7, 9))
dep_time? What other variables are missing? What might these rows represent?filter(flights, is.na(dep_time))
These rows are also missing dep_delay, arr_time, arr_delay and air_time. Given that they have all the scheduled details but are missing all actual flight data, these rows appear to represent cancelled flights.
NA ^ 0 not missing? Why is NA | TRUE not missing? Why is FALSE & NA not missing? Can you figure out the general rule? (NA * 0 is a tricky counterexample!)NA ^ 0
[1] 1
NA | TRUE
[1] TRUE
NA | FALSE # Counter-example
[1] NA
FALSE & NA
[1] FALSE
TRUE & NA # Counter-example
[1] NA
NA * 0
[1] NA
NA ^ 0 is not missing because any value to the power of zero equals 1 (although I don’t have an explanation right now why the same principle doesn’t apply for NA * 0). NA | TRUE is not missing because only one side of the ‘or’ operator needs to evaluate as true (conversely, note that NA | FALSE is missing). FALSE & NA is not missing because both sides of the ‘and’ operator would need to evaluate TRUE for it to be true so based on the right-hand side it would be false regardless of the actual value of the NA (conversely, note that TRUE & NA is missing).
arrange()arrange() to sort all missing values to the start? (Hint: use is.na()).Using dep_time as an example.
arrange(flights, desc(is.na(dep_time)))
arrange(flights, desc(arr_delay), dep_delay)
arrange(flights, air_time)
arrange(flights, desc(distance))
arrange(flights, distance)
select()dep_time, dep_delay, arr_time, and arr_delay from flights.select(flights, dep_time, dep_delay, arr_time, arr_delay)
select(flights, 4, 6, 7, 9)
select(flights, starts_with("dep_"), starts_with("arr_"))
Could keep going with a minus operator to drop all of the other columns, etc.
select() call?select(flights, dep_time, dep_time)
It doesn’t duplicate the variable.
one_of() function do? Why might it be helpful in conjunction with this vector?It allows selection of variables by matching against a vector of strongs. In the code below I’ve used it to select all of the variables that aren’t listed in the vector.
vars <- c("year", "month", "day", "dep_delay", "arr_delay")
select(flights, -one_of(vars))
select helpers deal with case by default? How can you change that default?select(flights, contains("TIME"))
By default the select helpers are case-insensitive. It can be modified by passing the argument ignore.case = FALSE.
select(flights, contains("TIME", ignore.case = FALSE))